Software graphs and programmer awareness 
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Dependencies between types in object-oriented software can be viewed as directed graphs, with 
types as nodes and dependencies as edges. The in-degree and out-degree distributions of such graphs 
have quite different forms, with the former resembling a power-law distribution and the latter an 
exponential distribution. This effect appears to be independent of application or type relationship. A 
simple generative model is proposed to explore the proposition that the difference arises because the 
programmer is aware of the out-degree of a type but not of its in-degree. The model reproduces the 
two distributions, and compares reasonably well to those observed in 14 different type relationships 
across 12 different Java applications. 
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I. INTRODUCTION 

Modern computer programs are large and highly struc- 
tured entities. Naturally the components of a program 
depend on each other, and in general if we think of the 
program components as nodes and dependencies as edges, 
a directed graph can be constructed. There is not a single 
graph for each program, as there are many different ways 
that 'node' and 'edge' might be defined. For example, for 
a given object-oriented program, we might construct the 
graph in which nodes are top-level types and an edge from 
type a to type 6 indicates that type a has a field of type 
b. Thus the number of types of fields declared in a can be 
considered as the 'out' degree of a, while the number of 
types having fields of type a is its 'in' degree. As another 
example, two top-level types could be linked when one 
contains a method (out) with a parameter of the other's 
type (in) . These different ways of constructing the graph 
will be referred to as different metrics. 

The distributions of in-degree and out-degree show a 
clear difference in form. This dimorphism was observed 
in a range of graphs generated from the source code of a 
large corpus of open-source Java software in [l[. It was 
found that the in-degree distributions were well fitted 
by power-law distributions, which appear as a straight 
line when plotted on logarithmic axes. The out-degree 
distributions, on the other hand, are noticeably curved 
on a log- log plot. 

This pattern appears regardless of the metric used or 
the application of the software examined 1, 2], and even 
appears in other kinds of software-derived graph struc- 
tures [1, 0, H . Software code is a direct product of the ac- 
tions of programmers, and therefore the resulting 'shapes' 
of the code result from these actions. It is clear that the 
difference between in-degree and out-degree distributions 
is not accidental, but is in some way a result of the way 



'Electronic address: gareth.baxter@mcs.vuw.ac.nz 



in which nodes and edges are created as the program is 
written. Because this basic shape is observed in such a 
variety of software and metrics, the mechanism must be 
quite general, and cannot depend on any specific features 
of the way different kinds of dependencies are created or 
different design methodologies are used to write software. 

While we may be able to characterize the different dis- 
tributions by fitting functions of different kinds, and ex- 
amining the parameters of the fitted functions for trends 
and patterns - for example, the exponent of a power law 
distribution - such descriptive models can never explain 
what we see at any deep level. We really want to know 
not just what shapes software has, but how these shapes 
come about. One explanation, suggested in is that 
the out-degree is more actively controlled by the pro- 
grammer. The outgoing edges of a type consist of refer- 
ences directly coded as the type is written, while for the 
in-degree, references to a type are created as other types 
are written. 

In this paper, we propose a simple generative model 
based on this observation which reproduces features ob- 
served in real software graph degree distributions. The 
model aims to capture the simplest actions of a program- 
mer with respect to type and dependency creation. New 
edges are added between existing nodes of the graph, 
and a new node can be created by the division of an ex- 
isting node into two parts. The treatment of incoming 
and outgoing connections between nodes is symmetrical 
in every way except that the division of nodes depends 
on the out-degree of the parent node in a specific way, 
but is independent of the node's in-degree. This rep- 
resents the programmer's awareness of the outgoing de- 
pendencies but not of the code elsewhere in the program 
which refers to the current type. It is found that this 
single asymmetry is sufficient to reproduce the difference 
in shape between the two degree distributions observed 
in real software. The model proposed is similar to that 
developed by Price Q to explain citation rates for pa- 
pers, though whereas Price's model produces power-law 
distributions for both in-degree and out-degree distribu- 
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tions, the introduction of the splitting step converts the 
out-degree distribution to an exponential distribution. 

We consider the degree distributions of a variety of 
inter-class relationships in the Java source code of 12 dif- 
ferent applications. These 12 programs are the largest 
of the 50 studied in [l[. The actual programs (and ver- 
sion numbers) used are listed in Appendix [Bj Each met- 
ric counts for each type a certain kind of dependency on 
other types. We consider 14 metrics that can be identified 
as measuring either an out-degree or in-degree. Metrics 
used in [l[ which could not be classified unambiguously 
as in- or out-degree distributions were left out. The met- 
rics arc defined in Appendix [XJ and each is referred to 
by a short abbreviation. Some pairs of metrics register 
opposite ends of the same relationship, suggesting that 
they are the out- and in- degree distributions of the same 
graph. However, considerations such as the separation 
of application code from shared libraries mean that the 
count of outgoing edges does not always match the to- 
tal of incoming edges. We will not go into detail about 
the differences between the metrics, as our aim is simply 
to demonstrate the extent to which the model described 
below reproduces the patterns observed in these various 
data. 

Plots of the 5 'in' metrics (see Appendix^} on logarith- 
mic axes often have a linear form, suggesting a power-law 
distribution is typical for graph in-degree distributions, 
regardless of the specific metric used or the particular 
program. Three examples are shown in Fig. [TJ which 
shows the degree distributions for three different 'in' met- 
rics (see Appendix |A| in three different programs of dif- 
fering sizes. The 9 'out' metrics (see Appendix \K§ do 
not appear linear on doubly logarithmic axes, however a 
number of the plots do have a linear shape when plot- 
ted on linear-logarithmic axes, suggesting an exponential 
distribution. 

Not all the data sets conform clearly to this pattern, 
having a slightly different shape or one or more points 
which do not fall on a neat curve. Nevertheless, the 
general pattern for in-degree is power-law like, and for 
out-degree seems to be an exponential shape. This sug- 
gests there is a common underlying process, modified to 
a greater or lesser degree by specific programmer actions 
in each case. 

Software graph structures have been examined in sev- 
eral recent studies 0, H, H, 0, M, S H, Q- In particular, 
several have reported power-law like degree distributions 
in graphs derived from source code 0, or from object 
relationships at run-time A distinction between in- 
degree and out-degree distributions has been observed 
in graphs derived from C and C++ software by Myers 
3], who treated both as approximate power-law distri- 
butions, and Valverde and Sole [5J],who in common with 
the present study of Java software, characterized the in- 
degree as a power-law and the out-degree as an exponen- 
tial distribution. They showed that these distributions 
can be generated by certain cases of the GNC ('grow- 
ing network model with copying') network growth model 



although the power law distribution generated by 
this model has a fixed exponent of 2. Yan, Qi and Gu 
examined Java applications, constructing the directed 
graph of 'import' relationships. Once again they note 
that the in-degree [3] distribution typically resembles a 
power-law while the out-degree has a largely exponential 
behavior. Concas et al. @ also studied a Java applica- 
tion, and noted a difference between in-degree and out- 
degree distributions. These observations are confirmed 
by the present study of a much larger group of Java ap- 
plications and metrics. Yan et al. also postulate a gen- 
erative model for such distributions, but make no claims 
about its plausibility in relation to programmer actions. 



II. MODEL 

As programmers write software code, they will peri- 
odically add references between types (edges), and occa- 
sionally create new types (nodes). It is these aspects of 
the code we are interested in, so the intervening code, 
which is actually the majority of the program and speci- 
fies all its functions, is ignored. We consider a simplified 
process, each step of which entails either the addition of 
an edge between two existing nodes or the addition of a 
new node. 

Generally there are a few elements that have many 
references to other parts of the program (these might 
be the most complex types), while there are many that 
reference only a few other types. This divergence can be 
approximated by ensuring that the "the rich get richer" , 
invoking the 'cumulative advantage' mechanism of Price 
@. New outgoing edges are added to the type which 
a programmer is 'currently working on', and we consider 
that the larger types (those with most out-edges already) 
are more likely to be added to in future. At each step a 
new edge is added, and the node the link originates from 
is chosen with a probability proportional to its current 
out-degree. 

Conversely, there are a few elements that are refer- 
enced repeatedly by many of the other parts of the pro- 
gram (we might think of these as the simplest and most 
universal elements), while there are many that are used 
only a few times (these might be more complex ele- 
ments, at a higher level of hierarchy, and therefore are 
less reusable) . We consider the number of incoming de- 
pendencies a type has to be an approximation to it's 
usefulness. Therefore the node to which a new edge will 
be linked is chosen with a probability proportional to its 
current in-degree. 

The programmer is conscious of the number of outgo- 
ing references he or she is adding to a node, and at some 
point may decide the type is 'too big' and create a new 
one. This effect is represented by allowing new nodes 
to occasionally be created by 'splitting' an existing node 
into two pieces. The edges attached to the original node 
are divided between the two resulting nodes with each 
possible division between the two nodes of incoming and 
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FIG. 1: Some examples of degree distributions for 'in' metrics that appear to have a power-law distribution. Both axes are 
logarithmic. Power law distributions appear straight on these axes. 
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FIG. 2: Some examples of degree distributions for 'out' metrics that appear to have an exponential distribution. a;-axis is 
linear, y-axis is logarithmic. Exponential distributions appear straight on these axes. 



outgoing edges equally likely, with the constraint that at 
least one outgoing edge must be transferred to the new 
type, and one incoming edge must remain in the original 
type. Finally, if wc think of the new node as carrying out 
a subset of the 'task' originally intended for the parent 
node, the two nodes must be connected, so a new edge 
is created from the original type to the new type. This 
also ensures that at all times every node has at least one 
incoming edge and one outgoing edge. 

Let Vi be the out-degree of node i, and Wi be its in- 
degree, with i running from 1 up to k, the current number 
of nodes. All of these quantities can grow as the process 
proceeds. Let t be the number of steps carried out so 
far. Since exactly 1 edge is added at each step, J2i v i = 
'Yl, i Wi = t. The process is initiated at t = 1 with a 
single node (type) with a single reference to itself (this is 
necessary, as links are only added to nodes which already 
possess links), i.e. k = 1 and v-i = lUi = 1. At each step: 

• select parent node m with probability v m /t 

• With probability (1 — 7) simply add an edge: 

o the parent node m is the out node, 
o select in node n with probability w n /t, 
v m — > v m + 1 and w n — > w„ + 1. 

• Otherwise, with probability 7, split the parent 
node: 



o add a new node k + 1 (the last node number 
increments from k — > k + 1), 

o choose r uniformly from {1, v m } and s uni- 
formly from {0, ...,w m — 1}[15|, then 

Ufc+i -> r, v m ->v m -r+l; 
w k +i —>-s + l and w m — > w m - s. 

• Increment the counter t t + 1 . 

These steps are repeated for some predetermined num- 
ber of steps t final- The entire simulation is defined by 
only two parameters: the total number of links required 
(equal to the number of simulation steps), tfi na i, and 
the splitting probability, 7. Since new types only ap- 
pear due to the splitting process, 7 determines the ratio 
between the number of types and the number of links: 
j — > 7 as t — > 00. Note that although nodes to be 
'worked on' are selected according to their out-degree, 
this model is actually symmetric with respect to in- and 
out-degree, except for the splitting step: nodes acquire 
outgoing edges at a rate proportional to their existing 
out-degree, and acquire incoming edges at a rate propor- 
tional to their in-degree. In this model edges are added 
one by one to different parts of the graph, so all the 
types grow at the same time. New nodes are also created 
during this process. This doesn't necessarily reflect the 
actual order in which program code is written. 

Although the graph continues to grow, after a suffi- 
cient number of steps, the relative degree frequencies - 
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FIG. 3: Cumulative degree distribution of 'out' metrics (as labeled, see Appendix for definitions) for netbeans-4.1 with best 
fit model out-degree distribution (solid gray line) as described in Section [TU Also shown for comparison is the best fit model 
in-degree distribution (dashed line). Horizontal scale is logarithmic and vertical scale is linear. 
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FIG. 4: Cumulative degree distribution of 'in' metrics (as labeled, see Appendix for definitions) for the largest program studied, 
netbeans-4 . 1 with best fit model in-degree distribution (dashed gray line) as described in Section|IT] Also shown for comparison 
is the best fit model out-degree distribution (solid line). Horizontal scale is logarithmic and vertical scale is linear. 
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FIG. 5: Top row: Degree distribution for AC ('out' metric) for three different programs (as labeled, see Appendix), on 
logarithmic-linear axes. Bottom row: Degree distribution for AP (the reciprocal 'in' metric) for the same three programs, 
on logarithmic-linear axes. 



normalized by the total number of nodes - reach an equi- 
librium distribution. Let C m be the number of types with 
out-degree to after step t. Considering the two processes 
involved in the model, C m can increase by 1 if an outgo- 
ing edge is added to a type with out-degree m — 1 and 
is not split, or if a type with out-degree greater than to 
is split at just the right place that one of the resulting 
types has out-degree to. Similarly, C m decreases if a type 
of size to gains a new out-link, or is split, so long as the 
point of splitting is not 1 or m. With a little consider- 
ation, we can write down the expected change in C m at 
the next step: 

(SC m ) = (1 - 7) (to - l)C m -i +7 rC r 2 - 

-(l-7)mC m -7mC m (l--). (1) 

m 

The expected fraction of types that have degree to is 

(C m {t)) 



fn 



<*(*)> 



(2) 



It follows from the definition that (SC m ) = "ffm- Sub- 
stituting back into (fTJ) we find after collecting like terms 
that 



(l + m-2 7 )/ m = (l- 7 )(m-l)/ TO _i + 27 



fr ■ (3) 



r=m-\-l 

This equation is valid for all m > 1. Replacing m by 
to — 1 in l[5]). rearranging for the summation term and 



substituting back into the original version of §3§ gives 
f m in terms of f m -i an d fm-2 , and after calculating 
fi and /2 explicitly we find by induction the solution for 
general to to be 



/ m =7(l-7) f 



(4) 



which can be written as an exponential f m — y-^e @ m 
where j3 = — ln(l — 7). 

A similar calculation can be performed for the in- 
degree distribution. The in-degree and out-degree of a 
type are completely independent, so the selection of a 
type for splitting is uniform with respect to in-degree. If 
g n , in analogy to f m , is the fraction of types with in- 
degree n we find that 



9n 



2 — + (1-7)" 



l>n 



= 2Ef + (l-7)(n-l)jn- 1 . 

(5) 



Using a similar method to before we find 
(1 - 7)71 



<-)n 



2 + (l-7)n 



9n-l 



so that 



r(n + i)r(2 + ^) 
r(n + i + 7 ^-) 



-.91 



(6) 



(7) 



and normalization can be used to find g\ . For large n, the 

-, that is, the in-degree 



ratio g n /g n -i tends to 1 
distribution tends to a power-law of the form g„ 



(1-7)' 



cn 
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with exponent a = 2/(1 — 7). Thus the model predicts 
a decaying exponential for the out-degree distribution, 
and an in-degree distribution with a power-law tail with 
exponent greater than or equal to 2. Examples of the 
two distributions (|4]) and (J7J are shown in Fig. [6l 
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FIG. 6: Examples of the degree distributions generated by the 
model, with parameter 7 = 0.3. Solid line is the out-degree 
model, /„, dashed line the in-degree model, g n . On linear- 
logarithmic axes, top, and double logarithmic axes, bottom. 



III. COMPARISON WITH MEASURED DATA 

The distributions JH) and ([7]) were fitted to the data de- 
scribed in Section |T] using a maximum likelihood method, 
which is asymptotically unbiased . Given some can- 
didate distribution f(x,j), the likelihood that the his- 
togrammed data hi at values Xi was generated from this 
distribution, given the parameter 7, is 

fe 

p(h| 7 )=n/(^ ) 7)' l< (8) 

i=l 

and we proceed by finding the value of 7 which maximizes 
this quantity. In the case of out metrics, /(a^, 7) is given 
by ([J}, and the maximum likelihood estimator (MLE) of 
7 is found analytically to be 



as expected, where t = J2i 

x 2 hi is the number of edges 
in the graph, and k = J^i hi is t ne number of nodes. 
For in metrics, we use ([7]), for which we have not been 
able to find a similar solution so the MLE of 7 must be 
found numerically. We again expect 7 = k/t because this 
was assumed in the derivation of (J7J), although in some 
cases this is not the best fit value. The explanation of 
this is not known, though this very simple model is not 
expected to explain every detail of the data. 

The cumulative distribution derived from function (j4|) 
using the best fit (MLE estimated) 7 value is plotted 
along with the out metric data in Fig. [3l This fig- 
ure plots all of the out-metrics for the same program, 
netbeans-4 . 1, and it can be seen that in the majority 
of cases the agreement with the data is reasonably good, 
even though the number of nodes in the graphs for dif- 
ferent metrics varies widely. Further examples are shown 
in the top half of Fig. [5l which shows the same metric, 
AC, for three different programs. Notice also that a fit of 
the 'wrong' model distribution (the predicted in-degree 
distribution) does not fit as well. Similarly, the cumula- 
tive best-fit functions ([7]) for each of the in metrics for 
netbeans-4. 1 is plotted in Fig.[4j and for three different 
programs for the same metric (AP) in the bottom half of 
Fig.O Again, many data sets show good agreement, and 
the in-degree model fits the data much better than the 
out-degree model (solid curve) . Comparisons were made 
for all the metrics and programs listed in Appendices lAl 
and [B] and Figs. [3l [4] and \5\ are fairly representative. An 
example of one of the better fits to an out metric data set 
is shown in Fig. [JJ and an example of a good in metric 
fit in Fig. M 

To obtain a quantitative measure of how well the model 
fits the data, we follow the method of [12| and calculate 
a p- value (the probability that the data were drawn from 
the proposed distribution) based on the Kolmogorov- 
Smirnov (KS) statistic (l3j 

D = max|S(x) - F(x)| (10) 

where S(x) is the cumulative distribution function (CDF) 
of the data and F{x) is the CDF of the proposed distri- 
bution (i.e. F(x) — J2 y <x f( x ))- The fitted distribution 
is correlated with the data, so treating it as the true dis- 
tribution will give a falsely high p- value. Instead we use 
a Monte Carlo procedure, following [12]: A large number 
of synthetic data sets is drawn from the best-fit distribu- 
tion, each one is fitted individually and the KS statistic 
(relative to its own best fit distribution) calculated for 
each of these fits. The p-value is then the fraction of 
these KS statistic values that are larger than that found 
for the original fit to the real data. 



The p- values calculated for the in metrics and out met- 
rics are tabulated in Table HI We see that quantitatively, 
the fits are not particularly good. The goodness of fit for 
out metrics is particularly poor. The in metrics are often 
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FIG. 7: AC for columba-1.0, an example of an out metric 
distribution that is well fitted by the model. 
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FIG. 8: AP for derby-10.1.1.0, an example of an in metric 
distribution that is well fitted by the model. 



are similar, there is separation between the two curves 
over part of the range in several cases, destroying the 
quantitative goodness of fit. Nevertheless, the model ap- 
pears to fit many of the curves well or be qualitatively in 
agreement for much of the range. 

Two assumptions of the model are that types gain 
edges at a rate proportional to the existing number of 
edges, and that when a type is subdivided the redistri- 
bution of edges between the two resulting types is uni- 
form. If the selection of nodes to which new edges are 
attached is uniform rather than linear, the resulting in- 
degree distribution has an exponential tail, which is not 
the case in real software graphs, so it appears that it is 
necessary that the rate of attachment of new edges to a 
node should depend on the existing degree of the node, 
though the explanation for this is not as clear as in the 
case of wealth distribution or paper citation rates. 

In the current model, new types are created by remov- 
ing references (edges) from an existing type and plac- 
ing it in a new type. It is equally plausible that a pro- 
grammer instead copies references, creating duplicates 
of edges. If this is the case, and none of the edges are 
transferred to the new node the out-degree distribution 
has a 'fatter' tail than the original exponential distribu- 
tion, but the 'head' of the distribution also becomes much 
steeper, meaning that this model fits the out-degree dis- 
tribution data very poorly. Alternatively if the edges are 
distributed between the two nodes according to a bino- 
mial distribution (so that each edge is equally likely to 
be attached to either node), the resulting out-degree dis- 
tribution has a distribution similar to an exponential dis- 
tribution, and the in-degree distribution has a power-law 
tail with a minimum exponent of 3. This is also incom- 
patible with the data, which typically have a power-law 
tail with exponent close to 2. 



TABLE I: Number of p-values greater than 0.05 and 0.1 for 
the out-degree and in-degree data sets. 



# p-values 
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> 0.05 
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out-metrics 4 / 101 


4% 


4 / 101 


4% 


in-metrics 9/44 


20% 


8/44 


18% 



reasonably well fit by the model, but where the fit fails, it 
is sometimes because the degree one point is lower than 
predicted - see for example Doinv for netbeans-4. 1 in 
Fig. 2J In other cases, the distribution appears to be a 
more pure power law than the model function. Overall, 
the heavy tail of the in degree distribution predicted by 
the model is successfully observed. The model fits to out 
metric distributions are generally worse than those for 
the in metrics and often have a more complicated shape, 
supporting the hypothesis that the difference between the 
distributions is the direct programmer intervention in the 
out-degrees of types. The KS statistic is the maximum 
difference between the model and data cumulative distri- 
butions, and we see from Fig. [3] that although the shapes 



IV. DISCUSSION 

In this paper we have described a simple model of 
the generation of software graph degree distributions 
based on the assumption that the process of program- 
ming involves programmers making active choices about 
the structure of the type on which they are working - 
in particular they are conscious of the 'size' of the type, 
and this comes to be reflected in the resulting out-degree, 
while the in-degree of a type emerges indirectly as a result 
of the construction of other types. The only difference 
between incoming and outgoing edges in the microscopic 
process of the model is that the splitting operation on 
types is dependent on type out-degrees and is indepen- 
dent of in-degrees. This model is extremely simple, and 
depends on a single parameter 7 which can be physically 
interpreted as the reciprocal of the average degree of the 
graph. This suggests that the mean number of dependen- 
cies per type (both incoming and outgoing) may prove to 
be a useful statistic for comparing different type relation- 
ships and different programs. The model reproduces the 



8 



approximate shapes of in-degree and out-degree distri- 
butions for a range of graph construction metrics applied 
to a variety of Java programs. This suggests that this 
shape is due to simple statistical processes common to 
all software graphs, so that any differences due to dif- 
ferent design methodologies and so on must be sought 
in the details of the deviations from this mean behavior. 
The agreement of the proposed model with the measured 
data is not perfect however. These and other difficulties, 
such as variations between the shapes of distributions be- 
tween programs and between metrics indicates that there 
are higher-order effects that remain to be described. 

Degree distribution is one of the most accessible mea- 
sures of the 'shape' of a network, but there are many 
more measures such as degree correlation, clustering coef- 
ficients and so on that can be calculated. Further analysis 
of real data sets for such more detailed measures would 
help to discriminate between and to refine the various 
generative models that have so far been proposed. An- 
other avenue of investigation that would be particularly 
fruitful would be a statistical analysis of the program- 
ming process as it happens, in order to identify the rates 
and probabilities of various actions, which could then be 
compared with the assumptions of the model proposed 
in this paper. 
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APPENDIX A: METRICS 

Brief descriptions of the metrics used in the study 
are listed below. Following [l], @] short codes are used 
throughout the text to refer to each metric. For more 
complete descriptions of the metrics and how the data 
was extracted from the source code, see 

Some of the metrics are paired, with one representing 
the 'in' degree (listed first) and another representing the 
reciprocal 'out' degree. In general an 'out' metric for a 
type counts things that would appear in the code for that 
type, while 'in' metrics count references to a type which 
appear in the code for other types. Some of the metrics 
have no reciprocal metric, but can still be identified as 
measuring either 'in' or 'out' degree. The remaining met- 
rics used in |l[ were not included in the analysis as their 
in/out status was not clear or they were a mixture of in 
and out measures. 



'In' metrics 

AP References to class as a member. For a given 
type, the number of top-level types (including it- 



self) in the source that have a field of that type. 

DOinv Depends On inverse. For a given type, the 
number of type implementations in which it ap- 
pears in their source. 

PP References to class as a parameter. For a given 
type, the number of top-level types in the source 
that declare a method with a parameter of that 
type. 

RP References to class as return type. For a given 
type, the number of top-level classes in the source 
that declare a method with that type as the return 
type. 

SP Subclasses. For a given class, the number of top- 
level classes that specify that class in their extends 
clause. 



'Out' metrics 

AC Members of class type. For a given type, the size 
of the set of types of fields for that type. 

DO Depends on. For a given type, the number of top- 
level types in the source that it needs in order to 
compile. 

PC Parameter-type class references. For a given 
type, the size of the set of types used as parameters 
in methods for that type. 

PubMC Public Method Count. The number of meth- 
ods in a type with public access type. 

RC Methods returning classes. For a given type, the 
size of the set of types used as return types for 
methods in that type. 

nC Number of Constructors. For a given class, the 
number of constructors of all access types declared 
in the class. 

nF Number of Fields. For a given type, the number 
of fields of all access types declared in the type. 

nM Number of Methods. For a given type, the num- 
ber of all methods of all access types (that is, pub- 
lic, protected, private, package private) declared 
(that is, not inherited) in the type. 

pkgSize Package Size. The number of types con- 
tained direction in a package (and not contained 
in sub-packages). 
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APPENDIX B: APPLICATIONS 



The programs studied, their size (number of Classes), 
domain and where they were sourced are: 



Application 


ft Classes 


Domain 


Origin 


Notes 


eclipse-SDK-3 . l-win32 


11413 


IDE 


www.eclipse.org 


Donated by IBM 


netbeans-4.1 


8406 


IDE 


netbeans.org 


Donated By Sun 


jre-1.4.2.04 


7257 


JRE 


sun.com 




jboss-4.0.3-SPl 


4143 


J2EE server 


Sourceforge 




openoffice-2.0.0 


2925 


Office suite 


openoffice.org 


Donated By Sun 


jtopen-4.9 


2857 


Java toolbox for iSeries and AS /400 


servers Sourceforge 


Donated by IBM 


geronimo-1.0-M5 


1719 


J2EE server 


Apache 




azureus-2.3.0.4 


1650 


P2P filesharing 


Sourceforge 




derby-10.1.1.0 


1386 


SQL database 


Apache Jakarta 


Donated by IBM 


compiere-251e 


1372 


ERP and CRM 


Sourceforge 




argouml-0.18.1 


1251 


UML drawing/critic 


tigris.org 




columba-1.0 


1180 


Email client 


Sourceforge 
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